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Abstract: We reveal the phenomenon that "naive" multivariate local polyno- 
mial regression can adapt to local smooth lower dimensional structure in the 
sense that it achieves the optimal convergence rate for nonparametric estima- 
tion of regression functions belonging to a Sobolev space when the predictor 
variables live on or close to a lower dimensional manifold. 

1. Introduction 

It is well known that worst case analysis of multivariate nonparametric regression 
procedures shows that performance deteriorates sharply as dimension increases. 
This is sometimes refered to as the curse of dimensionality. In particular, as initially 
demonstrated by [19|, l20|, if the regression function, m(x), belongs to a Sobolev 
space with smoothness p, there is no nonparametric estimator that can achieve a 
faster convergence rate than n 2 p+ d , where D is the dimensionality of the predictor 
vector X. 

On the other hand, there has recently been a surge in research on identifying 
intrinsic low dimensional structure from a seemingly high dimensional source, see 
P, [TBI, [2l| for instance. In these settings, it is assumed that the observed high- 
dimensional data are lying on a low dimensional smooth manifold. Examples of 
this situation are given in all of these papers — see also [l4|. If we can estimate 
the manifold, we can expect that we should be able to construct procedures which 
perform as well as if we know the structure. Even if the low dimensional structure 
obtains only in a neighborhood of a point, estimation at that point should be gov- 
erned by actual rather than ostensible dimension. In this paper, we shall study this 
situation in the context of nonparametric regression, assuming the predictor vec- 
tor has a lower dimensional smooth structure. We shall demonstrate the somewhat 
surprising phenomenon, suggested by Bickel in his 2004 Rietz lecture, that the pro- 
cedures used with the expectation that the ostensible dimension D is correct will, 
with appropriate adaptation not involving manifold estimation, achieve the optimal 
rate for manifold dimension d. 

Bickel conjectured in his 2004 Rietz lecture that, in predicting Y from X on the 
basis of a training sample, one could automatically adapt to the possibility that 
the apparently high dimensional X that one observed, in fact, lived on a much 
smaller dimensional manifold and that the regression function was smooth on that 
manifold. The degree of adaptation here means that the worst case analyses for 
prediction are governed by smoothness of the function on the manifold and not on 
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the space in which X ostensibly dwells, and that purely data dependent procedures 
can be constructed which achieve the lower bounds in all cases. 

In this paper, we make this statement precise with local polynomial regression. 
Local polynomial regression has been shown to be a useful nonparametric technique 
in various local modelling, see @, 0] . We shall sketch in Section 2 that local linear 
regression achieves this phenomenon for local smoothness p = 2, and will also argue 
that our procedure attains the global IMSE if global smoothness is assumed. We 
shall also sketch how polynomial regression can achieve the appropriate higher rate 
if more smoothness is assumed. 

A critical issue that needs to be faced is regularization since the correct choice 
of bandwidth will depend on the unknown local dimension d{x). Equivalently, we 
need to adapt to d(x). We apply local generalized cross validation, with the help of 
an estimate of d(x) due to 14j. We discuss this issue in Section 3. Finally we give 
some simulations in Section 4. 

A closely related technical report, [2j came to our attention while this paper was 
in preparation. Binev et al consider in a very general way, the construction of non- 
parametric estimation of regression where the predictor variables are distributed 
according to a fixed completely unknown distribution. In particular, although they 
did not consider this possibility, their method covers the case where the distribution 
of the predictor variables is concentrated on a manifold. However, their method is, 
for the moment, restricted to smoothness p < 1 and their criterion of performance 
is the integral of pointwise mean square error with respect to the underlying dis- 
tribution of the variables. Their approach is based on a tree construction which 
implicitly estimates the underlying measure as well as the regression. Our discus- 
sion is considerably more restrictive by applying only to predictors taking values 
in a low dimensional manifold but more general in discussing estimation of the 
regression function at a point. Binev et al promise a further paper where functions 
of general Lipschitz order are considered. 

Our point in this paper is mainly a philosophical one. We can unwittingly take 
advantage of low dimensional structure without knowing it. We do not give careful 
minimax arguments, but rather, partly out of laziness, employ the semi heuristic 
calculations present in much of the smoothing literature. 

Here is our setup. Let (Xj, Yi), (i = 1, 2, . . . , n) be i.i.d $t D+1 valued random vec- 
tors, where X is a D-dimensional predictor vector, Y is the corresponding univariate 
response variable. We aim to estimate the conditional mean mo(x) = E(Y\X = x) 
nonparametrically. Our crucial assumption is the existence of a local chart, i.e., 
each small patch of X (a neighborhood around x) is isomorphic to a ball in a d- 
dimensional Euclidean space, where d — d[x) < D may vary with x. Since we fix 
our working point x, we will use d for the sake of simplicity. The same rule applies 
to other notations which may also depend on x.) More precisely, let B^ r denote 
the ball in centered at z with radius r. A similar definition applies to B£ R . For 
small R > 0, we consider the neig hborhood of x, X x := B° R n X within X. We 
suppose there is a continuously differentiable bijective map (j> : Bq r i— > X x . Under 
this assumption with d < D, the distribution of X degenerates in the sense that it 
does not have positive density around x with respect to Lebesgue measure on $l D . 
However, the induced measure Q on Bq r defined below, can have a non-degenerate 
density with respect to Lebesgue measure on 3? d . Let S be an open subset of X x , 
and be its preimage in Then Q(Z G ^(S)) = P(X G S). We assume 

throughout that Q admits a continuous positive density function /(•). We proceed 
to our main result whose proof is given in the Appendix. 
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2. Local linear regression 



17j | develop the general theory for multivariate local polynomial regression in the 



usual context, i.e., the predictor vector has a D dimensional compact support in 
tft D . We shall modify their proof to show the " naive" (brute-force) multivariate local 
linear regression achieves the "oracle" convergence rate for the function m(4>(z)) on 

r>d 
°0,r- 

Local linear regression estimates the population regression function by a, where 
(a, (3) minimize 

n 

^(Yi-a- l3 T {X t - x)) 2 K h {Xi - x). 



Here Kh(-) is a D— variate kernel function. For the sake of simplicity, we choose the 
same bandwidth h for each coordinate. Let 



and W x = diag{Kh(Xi —x) 
function can be written as 



•l(*i-*) T - 

■l(X n -x) T - 
, Kh{X n — x)}. Then the estimator of the regression 



m(x, h) = el{X T x W x X x y x XlW x Y 
where ei is the [D + 1) x 1 vector having 1 in the first entry and elsewhere. 



2.1. Decomposition of the conditional MSE 

We enumerate the assumptions we need for establishing the main result. Let M be 
a canonical finite positive constant, 

(i) The kernel function K(-) is continuous and radially symmetric, hence bound- 
ed. 

(ii) There exists an e(0 < e < 1) such that the following asymptotic irrelevance 
conditions hold. 



E 



ir { x±) w (x)i(x g (B£ hl _. n x) c )] = o(^+ 2 ) 



for 7 = 1,2 and \w(x)\ < M(l + |a;| 2 ). 

(iii) v(x) = Var(Y\X = x) < M. 

(iv) The regression function m(x) is twice differentiable, and || q x ™ b ||oo < M for 
all 1 < a < b < D if x — (x%, . . . , xd)- 

(v) The density /(•) is continuously differentiable and strictly positive at in 

ffd 

Condition (ii) is satisfied if K has exponential tails since if V = x ^ x , the conditions 
can be written as 

E[K~<(V)w(x + hV)l(V e {B° hl -.r] = o(h d+2 ). 
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Theorem 2.1. Let x be an interior point in X . Then under assumptions (i)-(v), 
there exist some J\(x) and J%{x) such that 

E{m(x, h) - m(x)\Xi, ...,X n } = h 2 Ji(x)(l + o P (l)), 

Var{m(x, h) - m{x)\X u . . .,X n } = n - 1 h- d J 2 (x)(l + o P (l)). 

Remark 1. The predictor vector doesn't need to lie on a perfect smooth manifold. 
The same conclusion still holds as long as the predictor vector is "close" to a 
smooth manifold. Here "close" means the noise will not affect the first order of our 
asymptotics. That is, we think of Xx, . . . ,X n as being drawn from a probability 
distribution P on K D concentrated on the set 

X = {y : \4>(u) — y\ < e n for some u e B$ r } 

and e ra — > with n. It is easy to see from our arguments below that if e n = o(h), 
then our results still hold. 

Remark 2. When the point of interest x is on the boundary of the support X, we 
can show that the bias and variance have similar asymptotic expansions, following 



the Theorem 2.2 in [17|. But, given the extra complication of the embedding, the 



proof would be messier, and would not, we believe, add any insight. So we omit it. 



2. 2. Extensions 

It's somewhat surprising but not hard to show that if we assume the regression 
function m to be p times diffcrcntiable with all partial derivatives of order p bounded 
(p > 2, an integer), we can construct estimates m such that, 

E{m(x, h) - m(x)\X u ...,X n } = h p ^(^(l + o P (l)), 

Var{m{x, h) - m{x)\X u . . .,X n } = n^h- 1 J 2 {x){\ + o P (l)) 

__2p_ 

yielding the usual rate of n 2 p+ d for the conditional MSE of m(x, h) if h is chosen 
optimal, h = An~ 2 p+ d . This requires replacing local linear regression with local 
polynomial regression with a polynomial of order p — 1. We do not need to estimate 
the manifold as we might expect since the rate at which the bias term goes to 
is derived by first applying Taylor expansion with respect to the original predictor 
components, then obtaining the same rate in the lower dimensional space by a first 
order approximation of the manifold map. Essentially all we need is that, locally, 
the geodesic distance is roughly proportionate to the Euclidean distance. 



3. Bandwidth selection 

As usual this tells us, for p = 2, that we should use bandwidth An -3 ^ to achieve 

_ 2 

the best rate of n 4 + d . This requires knowledge of the local dimension as well as 
the usual difficult choice of A. More generally, dropping the requirement that the 
bandwidth for all components be the same we need to estimate d and choose the 
constants corresponding to each component in a simple data determined way. 

There is an enormous literature on bandwidth selection. There are three main 
approaches: plug-i n ( B, Fill Il8j . etc); the bootstrap (0, [ll], [I2|, etc) and cross 
validation ([y, [lfj, [22j, etc). The first has always seemed logically inconsistent to 
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us since it requires higher order smoothness of m than is assumed and if this 
higher order smoothness holds we would not use linear regression but a higher 
order polynomial. See also the discussion of [23j |. 

We propose to use a blockwise cross-validation procedure defined as follows. Let 
the data be (Xi,Yi), 1 < i < n. We consider a block of data points {(Aj, Yj) : j G 
J}, with \ J\ — ri\. Assuming the covariates have been standardized, we choose the 
same bandwidth h for all the points and all coordinates within the block. A leave- 
one-out cross validation with respect to the block while using the whole data set 
is defined as following. For each j G J, let m^j^{Xj) be the estimated regression 
function (evaluated at Xj) via local linear regression with the whole data set except 
Xj. In contrast to the usual leave-one-out cross-validation procedure, our modified 
leave-one-out cross-validation criterion is defined as mCV(h) = ^J^jejO^j 



^Xj)) 2 . Using a result from [23j], it can be shown that 



mCV{h) = — 



1 x . (^-m h (X,-)) 2 



(i-S h (j,j)) 2 

where Sh(j,j) is the diagonal element of the smoothing matrix Sh- We adopt the 
GCV idea proposed by [4j and replace the Sh{j,j) by their average atrj(Sh) — 
7T7 ^jeJ ^htiij)- Thereby our modified generalized cross-validation criterion is, 



iGCV(h) = - E 77 
n i r-C (1 



(^■-m^A,)) 2 



j (l-atr J (S h )r- 

The bandwidth h is chosen to minimize this criterion function. 

We give some heuristics for the justifying the (blockwise homoscedastic) mGCV. 
In a manner analogous to [23(, we can show 

In view of (|A.2p in the Appendix, we see Sh(j,j) = n^ 1 h~ d K(0)(Ai(X j ) + o p (l)). 
Thus as n- x h- d -> 0, 

atrj(S h ) = rr l h- d K(Q)(rq 1 £ A 1 {X j ) + o p (l)) 

= O p (n- 1 h- d ) = o p (l). 

Then, as is discussed in (2^], using the approximation (1 — x)~ 2 w 1 + 2x for small 
x, we can rewrite mGCV(h) as 

mGCV(h) = — V(r, - m^A,)) 2 + —trj(S h )—Y{Y 3 - m h (Xj)) 2 . 
n\ ni n\ 

Now regarding A- J2j£j0^j ~ ^hiXj)) 2 in the second term as an estimator of the 
constant variance for the focused block, the mGCV is approximately the same as 
the C p criterion, which is an estimator of the prediction error up to a constant. 



In practice, we first use 14j's approach to estimate the local dimension <i, which 

yields a consistent estimate d of d. Based on the estimated intrinsic dimensionality 

i i 

d, a set of candidate bandwidths CB = {Xin d "+ 4 , . . . , Xsn d + 4 } (Ai < • • • < A#) 
are chosen . We pick the one minimizing the mGCV (h) function. 
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4. Numerical experiments 

The data generating process is as following. The predictor vector X = (X^yX^y 
X( 3 )), where X^ will be sampled from a standard normal distribution, X( 2 ) = 
X^ + sin(X^) — 1, and X( 3 ) = log(X^ + 1) — X(iy The regression function 

m(x) = m(x(iyX( 2 yX( 3 y) — cos(x^) + xp) — x 2 3 y The response variable Y is 
generated via the mechanism Y = m(X) + e, where s has a standard normal 
distribution. By definition, the 3-dimensional regression function m(x) is essentially 
a 1-dimensional function of x^y n = 200 samples are drawn. The predictors are 
standardized before estimation. We estimate the regression function mix) by both 
the "oracle" univariate local linear (ull) regression with a single predictor Xn\ and 
our blind 3-variate local linear regression with all predictors Xny Xr 2 y X/ 3 y 

We focus on the middle block with 100 data points, with the number of neighbor 
parameter k, needed for Levina and Bickel's estimate, set to be 15. The intrinsic 
dimension estimator is d = 1.023, which is close to the true dimension, d = 1. 
We use the Epanechnikov kernel in our simulation. Our proposed modified GCV 
procedure is applied to both the ull and mil procedures. The estimation results are 
displayed in Figure 1. The x — axis is the standardized Xny From the right panel, 
we see the blind mil indeed performs almost as well as the "oracle" ull. 

Next, we allow the predictor vector to only lie close to a manifold. Specifically, 
we sample X (1) = X' (1) +e' 1 ,X {2) = Xfc + sin{X' (l) ) - 1 + e' 2 , X (3) = logpf^ + 1) - 
X',^s + 63, where X',^ is sampled from a standard normal distribution, and e'i,e 2 
and e' 3 are sampled from 7V(0, a' 2 ). The noise scale is hence governed by a'. In 
our experiment, a' is set to be 0.02, 0.04, . . . , 0.18, 0.20 respectively. The predictor 
vector samples are visualized in the left panel of Figure 2 with a' = 0.20. In the 
maximum noise scale case, the pattern of the predictor vector is somewhat vague. 
Again, a blind "mil" estimation is done with respect to new data generated in the 
aforementioned way. We plot the MSEs associated with different noise scales in the 
right panel of Figure 2. The moderate noise scales we've considered indeed don't 
have a significant influence on the performance of the "mil" estimator in terms of 
MSE. 




-25 -2 -1 5 -1 -05 0.5 1 1.5 2 2.5 -0.8 -0.6 -0 4 -0.2 02 0.4 06 0.8 



Fig 1. The case with perfect embedding. The left panel shows the complete data and fitting of 
the middle block by both univariate local linear (ull) regression and multivariate local linear (mil) 
regression with bandwidths chosen via our modified GCV. The focused block is amplified in the 
right panel. 
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Fig 2. The case with "imperfect" embedding. The left panel shows the predictor vector in a 3-D 
fashion with the noise scale a' = 0.2. The right panel gives the MSEs with respect to increasing 
noise scales. 



Appendix 

Proof of Theorem 2.1. Using the notation of [17j, H m (x) is the D x D Hessian 
matrix of m(x) at x, and 

Q m {x) = [(X! - xfHmWiX! - x ),---,{X n - x) T H m (x)(X n - x)f. 

Ruppert and Wand have obtained the bias term. 



(A.l) 



E(rh(x, h) ~ m(x)\Xi,- ■ ■ ,X n ) 



\fl{XlW x X x y x XlW x {Q m {x) + R m {x)} 



where if | • | denotes Euclidean norm, |i? m (a;) is of lower order than |Q m (a;)|. Also 
we have 



n- x XTW x X x 



ELiKhiXt-xjiX^x) 1 



_1 EiLi Kh(X { - x){Xi - x) n- 1 Y:7=i K h{Xi ~ x){Xi - x)(Xi - x) J 
The difference in our context lies in the following asymptotics. 
EK h (Xi - x) = E[K h (X t - x)l(Xi e B° hX -< n X)} 

+E[K h {x t - x)i(x t e (B£ fc i_. n x) c )} 

\ .1 jyd 



= h d - D (f(0) [ K{V(j){0)u)du + op(l)) 
= h d - D {A x {x)+o P {\)). 
Thus, by the LLN, we have 

n 

n- 1 K h (Xi -x) = h d - D (A, (x) + o P (l)) . 
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Similarly, there exist some A 2 {x) and A%(x) such that 

n 

n- 1 K h (X t - x){Xi -x) = h 2+d - D {A 2 (x) + o P (l)) 

i=l 

and 

n 

n- 1 K h{Xi - x)(Xi - x)(Xi - x) T = h 2+d - D (A 3 (x) + o P (l)) 
i=i 

where we used assumption (i) to remove the term of order h 1+d ~ D in deriving 
the asymptotic behavior of n^ 1 Kh(Xi — x)(Xj — x). Invoking Woodbury's 
formula, as in the proof of Lemma 5.1 in [13j |. leads us to 

-A 1 (x)- 1 +o P (l) P (1) 



(A.2) 



Op(1) h- 2 O p (l) 



On the other hand, 

n~ x X x W x Q m (x) 



1 Eti K h{Xi - x)(Xi - x) T H m {x){X t - x) 
-n- 1 Y, n i=x{ K h{Xi ~ x){Xi - x) T H m (x)(X z - x)}(X z - x) 

In a similar fashion, we can deduce that for some Bi(x),B2(x) i 

n 

n- 1 K h {Xi - x)(Xi - x) T H m {x){X t - x) = h 2+d - D (B 1 (x) + o P (l)) 



i=i 



and 



n 



-^{KhiXi -x)(Xi -x) T ft m (x)(X ( -x)}(Xi -x) = h 3+d - D (B 2 (x)+o P (l)). 



We have 
(A.3) 



n 1 X x W x Q m (x) = h 



d-D 



h 2 (B 1 (x) + o P (l)) 
h 3 (B 2 (x)+o P (l)) 



It follows from (|A.1|) , (|A.2[) and (|A.3|) that the bias admits the following approxi- 
mation. 

(A.4) E(m(x,h) -m(x)\X u ...,X n ) = h 2 A^x)' 1 B^x) + o P {h 2 ). 
Next, we move to the variance term. 



(A.5) 



Var{m(x, h)\X 1 , . . . , X n } 

= e T 1 (X T x W x X x y 1 X T x W x VW x X x {XlW x X x )- 1 e 1 . 

The upper-left entry of n' 1 X^W x VW x X x is 



n- 1 K h{Xi - x) 2 v(X t ) = h d - 2D C x {x){\ + op(l)). 
i=l 

The upper-right block is 

n 

n- 1 K h {Xi - x) 2 (A, - x) T v(Xi) = h 1+d - 2D C 2 (x)(l + o P (l)) 
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and the lower-right block is 

n 

n- 1 J2 K h{Xi - xf{X l - x)(Xi - xfviX,) = h 2+d - 2D C 3 (x)(l + o P (l)). 

i=l 

In light of (|A.2j) . we arrive at 

(A.6) Var{rh(x, h)\X u . . .,X n } = n^h^A^x^dix)^ + o P (l)). 

The proof is complete. □ 
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